class: center, middle, inverse, title-slide # Branching Out Into Isolation Forests ## R-Ladies Dallas ### Stephanie Kirmer
www.stephaniekirmer.com
@
data_stephanie
### December 7, 2020 --- # Follow Along! https://github.com/skirmer/isolation_forests --- # Introduction Isolation forests are a method using tree-based decisionmaking to separate observations instead of grouping them. You might visualize this in tree form: <img src="../IsolationForest1.png" alt="diagram1" width="600"/> --- # Introduction If you prefer to think about the points in two dimensional space, you can also use something like this:  Here you can see that a highly anomalous observation is easily separated from the bulk of the sample, while a non-anomalous one requires many more steps to isolate. --- # Getting Started Today we are going to implement this modeling approach using a sample of data from Spotify- song characteristics. We'll be using these libraries: * **modeling**: isotree * **visuals**: ggplot2, plotly, patchwork --- # Load Data From Kaggle - tracks on Spotify https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks?select=data.csv Let's identify really unusual tracks! What characteristics do we have? ``` ## [1] "acousticness" "artists" "danceability" "duration_ms" ## [5] "energy" "explicit" "id" "instrumentalness" ## [9] "key" "liveness" "loudness" "mode" ## [13] "name" "popularity" "release_date" "speechiness" ## [17] "tempo" "valence" "year" ``` --- # Looking at Examples Time to look at examples! Let's see some highly instrumental tracks. ```r head(dataset[dataset$instrumentalness > .94, c("artists", "name", "year")], 5) ``` ``` ## artists ## 21 ['Moritz Moszkowski', 'Vladimir Horowitz'] ## 23 ['Frédéric Chopin', 'Vladimir Horowitz'] ## 27 ['Hafız Yaşar'] ## 42 ['Dmitry Kabalevsky', 'Vladimir Horowitz'] ## 49 ['Shungi Music Crew'] ## name year ## 21 Etude in A-Flat, Op. 72, No. 11 1928 ## 23 Andante spianato in E-Flat Major, Op. 22 1928 ## 27 Kız Saçların 1928 ## 42 Sonata No. 3, Op. 46: II. Andante cantabile 1928 ## 49 Rumours 1928 ``` --- # Looking at Examples What about very "speechy" ones? I'm choosing records after 1965 so we'll see some that we might recognize. ```r set.seed(426) speechy = dataset[dataset$speechiness > .9 & dataset$year > 1965, c("artists", "name", "year")] speechy[sample(nrow(speechy), 3), ] ``` ``` ## artists name year ## 23123 ['John Mulaney'] Blacking Out and Making Money 2009 ## 46975 ['John Mulaney'] Law and Order and Mr. Jerry Orbach 2009 ## 76943 ['John Mulaney'] Crime News 2009 ``` --- # Looking at Examples Finally let's poke at the loud ones. ```r set.seed(400) loud = dataset[dataset$loudness > .85, c("artists", "name", "year")] loud[sample(nrow(loud), 3), ] ``` ``` ## artists name year ## 93132 ['The Stooges'] Search and Destroy - Iggy Pop Mix 1973 ## 108250 ['Apocolothoth'] Sold 1936 ## 127899 ['DYING SPASM'] drag 1944 ``` Yeah, that tracks! Sounds all right, that tells us something about these songs. --- # Modeling --- # Feature Engineering This is going to be very minimal, because we want to get right to the model. One thing I'm doing is binning the years, and cutting off songs before 1960, just because the data is a little different before that time. We want to find songs that are truly unusual, not just weird data collection. ```r dataset = dataset[dataset$year > 1960,] b <- c(-Inf, 1970, 1980, 1990, 2000, 2010, Inf) names <- c("60s", "70s", "80s", "90s", "00s", "10s to present") dataset$year_bin <- cut(dataset$year, breaks = b, labels = names) table(dataset$year_bin) ``` ``` ## ## 60s 70s 80s 90s 00s ## 20000 20000 20000 20000 20000 ## 10s to present ## 19656 ``` --- # Parameter Setup --- # Function Syntax We don't need to set any outcome or dependent variable because that is not the objective of this algorithm. ```r iso_ext = isolation.forest( training_set[, features], ndim=dim, ntrees=trees, nthreads=1, max_depth = max_depth, prob_pick_pooled_gain=0, prob_pick_avg_gain=0, output_score = FALSE) Z1 <- predict(iso_ext, training_set) Z2 <- predict(iso_ext, test_set) training_set$scores <- Z1 test_set$scores <- Z2 ``` --- # Peeking at Results ``` ## artists ## 92291 ['Herb Alpert & The Tijuana Brass'] ## 7231 ['Spawnbreezie'] ## 135399 ['ILLENIUM', 'Jon Bellion'] ## 61001 ['Bonnie "Prince" Billy'] ## 10548 ['Sammy Davis Jr.'] ## 12461 ['The The'] ## name year scores ## 92291 Whipped Cream 1965 0.4596230 ## 7231 Don't Let Go 2011 0.4429261 ## 135399 Good Things Fall Apart (with Jon Bellion) 2019 0.4415915 ## 61001 I See A Darkness 1998 0.4633557 ## 10548 Not for Me 1964 0.4405995 ## 12461 Soul Mining 1983 0.4540688 ``` --- # Peeking at Results ```r ggplot(training_set, aes(x=scores))+ theme_bw()+ geom_density() ``` <img src="isoforests_files/figure-html/unnamed-chunk-10-1.png" width="600" /> --- # Peeking at Results ```r training_set$anomaly = ifelse(training_set$scores > .52, "Anomaly", "Normal") ggplot(training_set, aes(x=tempo, y=speechiness, group = anomaly, color = anomaly))+ theme_bw()+ geom_point(alpha = .75)+ labs(title="Training Sample Score") ``` <img src="isoforests_files/figure-html/unnamed-chunk-11-1.png" width="650" /> --- # Peeking at Results ```r test_set$anomaly = ifelse(test_set$scores > .52, "Anomaly", "Normal") ggplot(test_set, aes(x=tempo, y=speechiness, group = anomaly, color = anomaly))+ theme_bw()+ geom_point(alpha = .75)+ labs(title="Test Sample Score") ``` <img src="isoforests_files/figure-html/unnamed-chunk-12-1.png" width="600" /> --- # Peeking at Results <img src="isoforests_files/figure-html/unnamed-chunk-13-1.png" width="650" /> --- # PCA ```r trainingpca <- prcomp(training_set[, features], scale. = T) std_dev <- trainingpca$sdev pr_var <- std_dev^2 prop_varex <- pr_var/sum(pr_var) plot(prop_varex, xlab = "Principal Component", ylab = "Proportion of Variance Explained", type = "b") ``` <img src="isoforests_files/figure-html/unnamed-chunk-14-1.png" width="650" /> ```r trainingpca = data.frame(training_set[, c("artists", "name", "year", "scores", "anomaly")], trainingpca$x) head(trainingpca[, c(1:8)]) ``` ``` ## artists ## 83108 ['The Isley Brothers'] ## 114731 ['Four Tops'] ## 116451 ['Tory Lanez', 'Rich The Kid', 'Lil Wayne'] ## 7928 ['RL Grime', 'Daya'] ## 69589 ['Calibre 50'] ## 66905 ['Tom Waits'] ## name year scores ## 83108 Love the One You're With 1971 0.4421467 ## 114731 It's The Same Old Song - Single Version / Mono 2000 0.4528982 ## 116451 TAlk tO Me (with Rich The Kid feat. Lil Wayne) - Remix 2018 0.4615679 ## 7928 I Wanna Know 2018 0.4526256 ## 69589 Callejero Y Mujeriego 2010 0.4529574 ## 66905 Rainbirds 1983 0.4848750 ## anomaly PC1 PC2 PC3 ## 83108 Normal 1.1172524 -0.7167091 -0.9102028 ## 114731 Normal 2.2657133 -1.2966916 -0.6690133 ## 116451 Normal 2.1163506 1.4961194 -0.4218515 ## 7928 Normal 0.3965404 0.8498761 0.7482969 ## 69589 Normal 2.3147320 -1.4823714 0.3584605 ## 66905 Normal -5.7222085 0.6981735 -1.1034533 ``` --- # PCA ```r tpca$anomaly <- as.factor(tpca$anomaly) fig <- plot_ly(tpca, x = ~PC2, y = ~PC3, z = ~PC1, color = ~anomaly, colors = c('#BF382A', '#0C4B8E'), width = 600, height = 350) fig <- fig %>% add_markers(size= 2) fig <- fig %>% layout(scene = list(xaxis = list(title = 'PC2'), yaxis = list(title = 'PC3'), zaxis = list(title = 'PC1'))) fig %>% layout(autosize = F, margin = m) ```
--- # Other Exploration <img src="isoforests_files/figure-html/unnamed-chunk-17-1.png" width="600" /> --- # Score Density <img src="isoforests_files/figure-html/unnamed-chunk-18-1.png" width="600" /> --- # Further Links/Reference https://ggplot2.tidyverse.org/ --- # Thank you! [www.stephaniekirmer.com](http://www.stephaniekirmer.com) | @[data_stephanie](http://www.twitter.com/data_stephanie) | [www.journera.com](http://www.journera.com)